Comparison of Similarity Coefficients for Clustering and Compound Selection

نویسندگان

  • Maciej Haranczyk
  • John D. Holliday
چکیده

Recent studies into the use of a selection of similarity coefficients, when applied to searches of chemical databases represented by binary fingerprints, have shown considerable variation in their retrieval performance and in the sets of compounds being retrieved. The main factor influencing performance is the density distribution of the bitstrings for the active class, a feature which is closely related to molecular size. If this is the case when these coefficients are applied to similarity searches, then we would expect considerable variation in performance when applied to dissimilarity methods, namely clustering and compound selection. Here we report on several studies which have been undertaken to investigate the relative performance of 13 association and correlation coefficients, which have been shown to exhibit complementary performance in similarity searches, when applied to hierarchical and nonhierarchical clustering methods and to a compound selection methodology. Results suggest that the correlation coefficients perform consistently well for clustering and compound selection, as does the Baroni-Urbani/Buser association coefficient. Surprisingly, these often outperform the Tanimoto coefficient, while the Simple Match (effectively the complement of the Squared Euclidean Distance) performs very poorly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

Signal processing approaches as novel tools for the clustering of N-acetyl-β-D-glucosaminidases

Nowadays, the clustering of proteins and enzymes in particular, are one of the most popular topics in bioinformatics. Increasing number of chitinase genes from different organisms and their sequences have beenidentified. So far, various mathematical algorithms for the clustering of chitinase genes have been used butmost of them seem to be confusing and sometimes insufficient. In the...

متن کامل

Machine Cell Formation Based on a New Similarity Coefficient

One of the designs of cellular manufacturing systems (CMS) requires that a machine population be partitioned into machine cells. Numerous methods are available for clustering machines into machine cells. One method involves using a similarity coefficient. Similarity coefficients between machines are not absolute, and they still need more attention from researchers. Although there are a number o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of chemical information and modeling

دوره 48 3  شماره 

صفحات  -

تاریخ انتشار 2008